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CDS T cells are specialized immune cells that play an important 
role in the regulation of antiviral immune response and the genera- 
tion of protective immunity. In this paper we investigate the differ- 
entiation of memory CDS T cells in the immune response using a 
short time course microarray experiment. Structurally, this experi- 
ment is similar to many in that it involves measurements taken on 
independent samples, in one biological group, at a small number of 
irregularly spaced time points, and exhibiting patterns of temporal 
nonstationarity. To analyze this CDS T-cell experiment, we develop a 
hierarchical state space model so that we can: (1) detect temporally 
differentially expressed genes, (2) identify the direction of successive 
changes over time, and (3) assess the magnitude of successive changes 
over time. We incorporate hidden Markov models into our model to 
utilize the information embedded in the time series and set up the 
proposed hierarchical state space model in an empirical Bayes frame- 
work to utilize the population information from the large-scale data. 
Analysis of the CDS T-cell experiment using the proposed model re- 
sults in biologically meaningful findings. Temporal patterns involved 
in the differentiation of memory CDS T cells are summarized sep- 
arately and performance of the proposed model is illustrated in a 
simulation study. 

1. Introduction. 

1.1. Background. Time course microarray experiments iiave become in- 
creasingly popular in tiie study of dynamic biological processes due to their 
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ability to monitor tens of thousands of genes over time. As of June 2004 
[Ernst, Nau and Bar- Joseph (2005)], 80% of time course microarray exper- 
iments were short time series with fewer than eight time points, according 
to the Stanford Microarray Database [Gollub et al. (2003)]. In this paper 
we are particularly interested in a short time course microarray experiment 
on memory CDS T cell differentiation originally described and analyzed in 
Kaech et al. (2002a). This CDS T-cell experiment was done in the context 
of a large research effort to understand immune memory in Rati Ahmed's 
laboratory of the Emory Vaccine Research Center. Here immune memory 
refers to the ability of the immune system to remember its first exposure to 
a specific antigen and to mount a rapid and aggressive response to a second 
exposure. In the immune system, CDS T cells are specialized immune cells 
that play an important role in the regulation of antiviral response and the 
generation of protective immunity. In response to a viral infection, naive CDS 
T cells differentiate into effector CDS T cells that control the infection and 
the effector CDS T cells that survive continue to differentiate into long-lived 
protective memory CDS T cells [Kaech, Wherry and Ahmed (2002b)]. 

In this CDS T-cell experiment, acute lymphocytic choriomeningitis virus 
Armstrong (LCMV) infection of mice was used as a model system to study 
memory CDS T cell differentiation. Genetically identical, uninfected mice 
were sacrificed on the baseline day (nai've) to obtain naive CDS T cells. 
Other genetically identical mice were infected with LCMV on the baseline 
day. Then mice were sacrificed at day S (dS) and day 15 (dl5) to obtain 
effector CDS T cells, and at greater than day 30 (Imm) to obtain memory 
CDS T cehs. Affymetrix MG-U74AV2 arrays were used to measure 12,48S 
genes in P14 CDS T cells from mouse spleens at these four time points. For 
each chip, cells from at least three mice were pooled to obtain sufficient RNA 
for MG-U74AV2 hybridization. Structurally, this CDS T-cell experiment is 
similar to many in that it involves measurements taken on different mice 
(independent sampling), in one biological group, at a small number of irreg- 
ularly spaced time points, and exhibiting patterns of temporal nonstation- 
arity. The goal of the analysis is to assist investigators in understanding the 
underlying system biology by identifying temporally differentially expressed 
(TDE) genes and characterizing temporal changes involved in memory CDS 
T cell differentiation. 

In the original analysis, Kaech et al. (2002a) selected genes based on 
whether their average gene expression levels changed (decreased or increased) 
at any time point by at least 1.7 fold (original linear scale) compared to the 
first time point, generating a set of 431 genes. They applied a K-means clus- 
tering algorithm on the selected genes and found six major patterns out of 
10 clusters. In this original analysis, the temporal aspects of the data were 
ignored and both the fold change cutoff in the selection method and the 
number of clusters in K-means clustering were chosen arbitrarily. Although 
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they identified several individual genes known to be important in differenti- 
ating naive, effector and memory CDS T cells, the biological meaning of the 
obtained clusters is not clear and the interpretation of clustering results is 
not straightforward. To improve interpretation, we re-analyzed this CDS T- 
cell experiment by focusing on the direction (upregulation, downregulation 
and no change) and the magnitude of the gene-specific successive differences 
(changes) in the mean gene expression levels (log base 2 scale) over time. 

1.2. Analysis of time course microarray data. Methods up to now for 
time course microarray experiments can be divided into two classes: (1) 
methods extended from those for static microarray experiments, and (2) 
methods extended from time series analysis. Examples of the first type of 
methods include hierarchical clustering [Eisen et al. (199S) and Spellman et al 
(199S)], K-means clustering [Tavazoie et al. (1999), Kaech et al. (2002a) 
and Bar-Joseph et al. (2002)], self-organizing maps [Tamayo et al. (1999)], 
singular value decomposition [Alter, Brown and Botstein (2000) and 
Wall, Dyck and Brettin (2001)], ANOVA-based analysis [Park et al. (2003)] 
and pairwise analysis. Ignoring the temporal aspects of data, these types of 
methods have the potential to suffer from low sensitivity. Examples of meth- 
ods inspired by time series analysis include Auto-Regression (AR) based 
models, multivariate Normal models, B-splines based models, and hidden 
Markov models (HMMs). Ramoni, Sebastiani and Cohen (2002) represented 
gene expression sequences as stationary series produced from a finite num- 
ber of AR processes and applied an agglomerative Bayes clustering al- 
gorithm to search gene clusters. Using the multivariate Normal distribu- 
tion, Tai and Speed (2006) developed a multivariate hierarchical empirical 
Bayes model to identify TDE genes. Storey et al. (2005) proposed a gen- 
eral framework for time course microarray experiments by modeling a gene- 
specific expression sequence as a linear expansion of B-spline basis func- 
tions. Hong and Li (2006) proposed to add a hierarchical structure into a B- 
spline based model to identify genes temporally differentially expressed be- 
tween two biological groups [see Spellman et al. (199S), Klevecz (2000) and 
Heard, Holmes and Stephens (2006) for other choices for basis functions]. 
Schliep, Schonhuth and Steinhoff (2003, 2004) proposed to use a mixture of 
hidden Markov models to model the gene expression level sequence and to 
cluster genes where a heuristic cluster deletion and splitting procedure is 
utilized to determine the number of clusters. Yuan and Kendziorski (2006) 
constructed a hidden Markov model to infer the gene-specific relative rela- 
tionship of multiple biological groups over time. Zhou and Wakefield (2006) 
proposed a first-order random walk model to partition TDE genes where 
the optimal number of partitions is chosen based on posterior probabilities 
via birth-death MCMC. Although these methods are useful in the analysis 
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of time course microarray experiments, none of them can be used to inves- 
tigate both the direction and the magnitude of successive changes in the 
mean gene expression levels in the CDS T-cell experiment. 

To add to these useful methods, we develop a hierarchical state space 
model using the observed gene expression levels along with the direction 
and the magnitude of successive changes in the mean gene expression lev- 
els. To improve the sensitivity of detecting TDE genes, we incorporate 
HMMs into our model to utilize the correlation information embedded in 
the time series [Yuan and Kendziorski (2006)] and set up our model in an 
empirical Bayes framework to utilize information from the gene population 
[Efron et al. (2001) and Newton et al. (2001)]. Analysis of the CDS T-cell 
experiment using the proposed hierarchical state space model results in the 
identification of more significant genes (over 1200 vs 431 in the original 
analysis) and the discovery of new important temporal patterns such as 
continuous upregulation over time (see Section 3). 

2. Hierarchical state space model. 

2.1. Hierarchical state space model for the CDS T-cell experiment. In 
the CDS T-cell experiment, expression levels were measured for G= 12,48S 
genes at T = 4 time points where large G and small T are typical of many 
time course microarray studies. Replicates were independently sampled with 
three replicates at day 15 and four replicates at each of the other three time 
points. We denote by Xgtf^ the observed gene expression level (log base 2 
scale) for gene g at the tth time point on array k for k = 1, . . . ,nt and denote 
by Xgt = {xgti , . . . , Xgtnt ) thc observed gene expression levels for gene g at 
the tth time point. The observed gene expression levels can be represented 
by a 12,4S8 x 15 matrix whose rows represent genes and columns represent 
arrays. The typical data layout is shown in Table 1. 

Our interest lies in the gene-specific mean expression levels over time, a 
latent mean sequence denoted by /i^ = (/igi, . . . , /^^t) where ^Xgi = E{xgtk)- 

Table 1 

Data layout for microarray experiments with one biological group and independent 

sampling 



Time 1 Time 2 ... Time T 

1 ... ni 1 ... n2 ... 1 ... ?iT 

Gene 1 arm . . . Siinj 2:121 • ■ . a;i2„2 • ■ • ^iri ■ ■ ■ xittit 

Gene 2 a::2ii . . . X2ini X221 ■ ■ ■ a::22n2 • ■ • X2T1 ■ ■ ■ X2TnT 

Gene G xgu ■ ■ ■ xaini xg2i ■ ■ ■ XG2n2 ■ ■ ■ xgti ■ ■ ■ xgtut 
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Fig. 1. j4n example of three temporal patterns with independent sampling and illustration 
of the first-order hierarchical state space model: In the left panel of the figure, the y-axis 
represents the gene expression level (log base 2 scale) and the x-axis represents the time 
points. Solid lines represent the underlying mean gene expression levels (log base 2 scale) 
over time and stars represent the observed gene expression levels of three genes m an 
experiment with four replicates at each time point. 



Specifically, we are interested in the magnitude of successive changes in 
a latent sequence composed by 5gt = Hgt — /^g{t-i) for t = 2, . . . , T, and the 
direction of successive changes in fig, a latent state sequence denoted by = 
{sgi, . . . , SgT)- Here the state at the first time point is defined as "start" and 
the state at each of the other time points could be upregulation (+ : 5gt > 0), 
downregulation (— : 5gt < 0), or no change (=: 5gt = 0). This definition leads 
to a total of 3^~^ possible temporal patterns for an experiment with T time 
points (denote by S) and thus a total of 27 possible temporal patterns for 
the CDS T-cell experiment. The left panel of Figure 1 shows an example 
with three temporal patterns in an experiment with 4 time points. The 
goal of detecting TDE genes is equivalent to finding genes with either the 
upregulation or downregulation state at any time point after the first time 
point. The goal of characterizing TDE genes can be achieved by finding all 
distinctive temporal patterns. 

For the CDS T-cell experiment, we develop a hierarchical state space 
model in three levels (the observation level, the mean level and the state 
level) as follows. In the observation level, the observed gene expression lev- 
els at time t (xgt) are regarded as a random sample independently produced 
from a Normal distribution with mean figt and a common standard devia- 
tion fj, similar to Kendziorski et al. (2003). In the mean level, the successive 
change of the mean gene expression level under the state of no change is by 
definition and the successive change of the mean gene expression level under 
the upregulation (downregulation) state is regarded as a random sample in- 
dependently produced from an unknown 0-left-truncated (0-right-truncated) 
Normal distribution. In the state level, the state sequence is regarded as a 
random sample independently produced from an unknown discrete Markov 
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process in which various orders of Markov dependence are allowed, similar to 
Yuan and Kendziorski (2006). The order of the proposed hierarchical state 
space model is defined by the order of the Markov process in the state level. 
See the right panel of Figure 1 for an illustration of the first-order hierar- 
chical state space model. The proposed model is established in an empirical 
Bayes framework to utilize information from the large gene population. The 
marginal distribution of Xg = (x^i, . . . ,Xgr) in such case is 

/(^s) = J2 ^"^(^9 = v)/(x9|sg = v), 

a mixture of the conditional distributions of x^ given state series v. Here 
Pr(sg = v) = Pr(v) is the population proportion of the state series v with 
the constraint EvgsPi'Iv) = 1; fi^gl^g = v) = J f{xg\ng)f{fig\sg = w)dHg 
is the conditional distribution of x^ given state series v. To simplify notation, 
/(•|-) and /(•) are used throughout to denote a generic conditional and 
marginal density function respectively. Likewise, Pr(-|-) and Pr(-) are used 
throughout to denote a generic conditional and marginal probability function 
respectively. 

To fit the proposed hierarchical state space model, we need to estimate pa- 
rameters specifying the Normal distribution in the observation level, param- 
eters specifying the truncated Normal distributions in the mean level, and 
parameters specifying the Markov process in the state level where all these 
parameters (denoted by 0) are universal for all genes. Model fitting proceeds 
by an implementation of the EM algorithm [Dempster, Laird and Rubin 
(1977)] to produce estimates of fixed effects 6 and posterior distributions for 
latent mean sequences ^ig and latent state sequences Sg (see Appendix A) . 

2.2. Inference. After the model fitting procedure described above, in- 
ference ultimately utilizes the gene-specific posterior probabilities of state 
sequences 

(1) Pr(sg = vjxg) =Pr(sg = v)/(x3|S(, = v)//(xg), vGcS 

(a gene-specific vector of the posterior probabilities of the 27 state sequences 
in the case of the CDS T-cell data) and the gene-specific posterior distribu- 
tion of the mean sequence 

(2) /(/^g|Xg) = ^ Pr(Sg = w\Xg)f{^lg\Xg,Sg = v), 

a mixture of the conditional posterior distribution over all possible state 
sequences. From (1) an optimal state sequence may be obtained either 
by separately maximizing the marginal posterior probability at each time 
point (MMP), or by maximizing the joint posterior probability over all 
time points (MJP). A gene for which the optimal state sequence entails 
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some temporal changes is a candidate for a short hst of interesting, tempo- 
rally altered genes. The false discovery rate for such a list may be derived 
from the posterior probabilities themselves, as shown in Appendix B [see 
Benjamini and Hochberg (1995), Efron et al. (2001), Storey (2002, 2003), 
Dudoit, Shaffer and Boldrick (2003) and Newton et al. (2004) for further 
details about the false discovery rate]. From (2) a posterior summary statis- 
tic such as the posterior mean can be obtained to summarize /x^. 

3. Results. 

3.1. Analysis of the CDS T-cell experiment. We re-analyzed the CDS 
T-cell experiment with the proposed hierarchical state space model to iden- 
tify and to cluster genes involved in the development of memory CDS T 
cells. Background correction, normalization and probe-set summaries were 
performed on the Affymetrix chip data using RMA from the Bioconductor 
affy package [Irizarry et al. (2003)]. Expression values were obtained on the 
log base 2 scale. Here data were analyzed using both a first-order hierarchi- 
cal state space model (first-order HST) and a full-order hierarchical state 
space model (full-order HST). MMP and MJP were each applied to the 
gene-specific posterior probabilities of state sequences separately to obtain 
the gene-specific optimal state sequence. 

Plots of the estimated population distributions for successive changes un- 
der the upregulation state and the downregulation state at each time point 
after the first time point are shown in Figure 2. Here dashed (solid) lines 
represent the estimated population distribution for successive changes under 
the upregulation (downregulation) state. The first-order HST model and the 
full-order HST model provide similar population distribution estimates. 

Table 2 presents the estimated initial state probability vector and the state 
transition matrices of the discrete Markov process from the first-order HST 
model, suggesting a high correlation across time. In detail, compared to genes 
with no change at the previous time point, genes with change at the previous 
time point (state ="— " or "+") tend to have a much higher probability 
of change at the current time point. This pattern indicates that utilizing 
the sequential information has the potential of increasing the sensitivity of 
detecting TDE genes. The increase in sensitivity is also demonstrated in a 
simulation study following the analysis of the CDS T-cell data. In addition, 
most genes that change in the first period (Naive to dS) tend to have an 
opposite pattern in the second period (dS to dl5) where Pr(sg3 = "— " |sg2 = 
"+" = 0.68 and Pr(sg3 = "+" \sg2 = "-") = 0.72. This pattern indicates that 
utilizing the sequential information from the population has the potential of 
reducing the misclassification rate. 

Table 3 shows the number of genes identified as TDE genes using differ- 
ent models and different optimality criteria (MMP and MJP). These four 
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Fig. 2. Estimated population distributions for successive changes under the upregula- 
tion and the downregulation state over time. Here dashed (solid) lines represent the esti- 
mated population distribution for successive changes under the upregulation ( downregula- 
tion) state. 

Table 2 

Estimated parameters of the first-order Markov process in the state level 



Naive to d8 Transition from d8 to dl5 Transition from dl5 to Imm 





d8\dl5 


+ 






dl5\Imm 


+ 






0.04 0.08 0.88 


+ 


0.09 


0.68 


0.23 


+ 


0.33 


0.16 


0.51 






0.72 


0.00 


0.28 




0.00 


0.18 


0.82 






0.00 


0.00 


1.00 




0.00 


0.00 


1.00 



approaches yield similar numbers. More genes are identified as TDE in the 
earlier period than in the later period. Given the same model, conducting 
inference using the MMP results in more genes identified as TDE genes 
compared to conducting inference using the MJP. Given the same optimal- 
ity criteria, more genes were identified as TDE genes under the full-order 
HST model than under the first-order HST model. 

Clustering results from the first-order HST model are shown in Figure 3. 
The first-order and the full-order HST models produce very similar results 
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Table 3 

Number of genes identified as TDE genes between each two adjacent time points under 
different models (first-order and full- order HST models) and different optimality criteria 

(MMP and MJP) 



Model 


Criteria 


d8 


dl5 


Imm 


Across time series 


First-order model 


MMP 


1318 


951 


328 


1318 


MJP 


1244 


951 


351 


1244 


Full-order model 


MMP 


1310 


803 


450 


1342 


MJP 


1215 


815 


444 


1245 




in the state level. Using the MMP, for example, the first-order HST model 
identifies 13 patterns and the full-order HST model identifies 15 patterns 
with a total of 11 patterns in common. With respect to all the genes, about 
97.7% of the genes share the same optimal state series in these two models. 
Among genes identified as TDE in either of these models (1318 genes for 
first-order HST model and 1342 genes for full-order HST model), 1308 genes 
are in common. For these 1308 genes, about 81.5% share the same optimal 
state series, about 98.3% share the same optimal state at time point 2, 
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about 87.7% share the same optimal state at time point 3, and about 92.0% 
share the same optimal state at time point 4, indicating a good consistency 
between the first-order and the full-order HST models. 

We compared our results from the first-order HST model with the MMP 
criteria to the results obtained from the original analysis using the K-means 
clustering algorithm [Kaech et al. (2002a)]. Both methods identify clusters 
"start, +,—, —," "start, —, -|-, +," and "+,=,=." Information on these clus- 
ters is shown below: 

Cluster "start, -|-, —, —": Upregulated at day 8 and gradually downregu- 
lated at day 15 and memory stage. This cluster contains 46 genes includ- 
ing: genes related to T cell effector functions such as GZMA, GZMB and 
GZMK; genes related to cell adhesion and migration such as CCR2; genes 
related to membrane proteins such as KLRGl. 

Cluster "start, —, -|-, -|-": Downregulated at day 8 and gradually upregu- 
lated at day 15 and memory stage. This cluster contains 192 genes in- 
cluding genes related to T cell signal transduction such as IL7R; genes 
related to apopotosis/survival such as BCL2; genes related to cell adhe- 
sion and migration such as CXCR4 and CD62L. 

Cluster "start , -|-, =, = ": Upregulated at day 8 with no change over the 
following time points. This cluster contains 282 genes including genes 
related to cell adhesion and migration such as CD44 and genes related to 
membrane proteins such as LY6A. 

Our method also identifies new clusters including the following: 

Cluster "start, -|-, -|-, -|-": Continuously upregulated at all time points after 
the first time point and involved in the differentiation of memory CD8 
T cells, this cluster of genes can be used to differentiate naive, effector 
and memory CD8 T cells. This cluster contains 8 genes including genes 
related to T cell effector functions such as IFNg and genes related to cell 
adhesion and migration such as CXCR3. 

Cluster "start, -|-, —, = ": Upregulated at day 8, downregulated at day 15, 
and no change at memory stage. This cluster contains 282 genes including 
genes related to T cell effector functions such as FASL and genes related 
to cell adhesion and migration such as CCR5. 

In addition, no important biomarkers fall into the cluster with pattern 
"start, —,+, —." Examining genes that fall into this cluster, we found that 
few genes have function or pathway related to the immune response, ac- 
cording to currently available annotation information from Gene Ontology 
[Vinayagam et al. (2004)]. These observed changes may be just due to ran- 
dom variation. Compared to previous clusters obtained by the K-means clus- 
tering, our model identified more clusters with clearer and more meaningful 
patterns. Based on clusters produced from our model, the investigators may 
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more readily identify patterns involved in the CDS T cell differentiation and 
focus on genes with patterns similar to important biomarkers. 

3.2. Simulation. In addition to the analysis of the CDS T-cell data, a 
simulation study was carried out to evaluate the performance of the pro- 
posed model. We generated 100 data sets each with 4,000 genes and four 
replicates at each of 4 time points. Data were produced from the first-order 
HST model in the order of state series, mean gene expression series and 
gene expression levels. We simulated the data using parameters estimated 
from the CDS T-cell data to mimic the real scenario. First, we draw state 
series independently from a first-order Markov process (see Table 2). Then, 
we draw mean gene expression levels at the first time point independently 
from a Normal distribution and draw relative changes under the upregu- 
lation (downregulation) state independently from the 0-left-truncated (0- 
right-truncated) Normal distribution (see Figure 2). In turn, observed gene 
expression levels were independently produced from a Normal distribution 
conditional on the mean gene expression levels. Simulated data sets were 
analyzed using the following three methods: (I) a first-order HST model, 
(II) a zero-order HST model (independent HST model), and (HI) a pairwise 
analysis method. Here the pairwise analysis was done by comparing each of 
the three successive changes in mean gene expression levels separately, ig- 
noring changes at the other periods. Model parameters were estimated using 
the EM algorithm and the MMP criterion was applied to obtain the gene- 
specific optimal state sequence. Performance of these methods was assessed 
in both the discrete state level and the continuous mean level. 

In the state level, we evaluated these three methods by comparing speci- 
ficity, sensitivity, the false discovery rate, and the misclassification rate of 
the states at each time point along with the misclassification rate of the 
state series. Here the misclassification rate of the states (state series) is the 
proportion of genes whose optimal state (state series) does not match the 
true state (state series). Define the state of no change ("=') as the tem- 
porally nondifferentially expressed (TNDE) state and the other two states 
("+" or "— ") as the temporally differentially expressed (TDE) state. At 
each time point after the first time point, the sensitivity is the proportion of 
TDE states that are correctly identified as TDE, the specificity is the pro- 
portion of TNDE states that are correctly identified as TNDE, and the false 
discovery rate is the proportion of false TDE states among those identified 
as TDE. 

Table 4 and Figure 4 summarize the simulation results in the state level. 
The first-order HST model has a lower misclassification rate than the inde- 
pendent HST model, which in turn has a lower misclassification rate than 
the pairwise method. Noticeably, the first-order HST model reduces the mis- 
classification rate of state series by 32% (49%) compared to the independent 
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Table 4 

Summary of simulation results in the state level. Here I represents the first-order HST 
model, II represents the independent HST model, and III represents the pairwise method. 
MR is the misclassification rate of states at each time point and SMR is the 
misclassification rate of state series 



Method Time 1 to Time 2 Time 2 to Time 3 Time 3 to Time 4 





I 


0.158 


(0.026) 


0.091 


(0.020) 


0.119 (0.036) 


FDR 


II 


0.197 


(0.031) 


0.066 


(0.022) 


0.164 (0.047) 




III 


0.191 


(0.027) 


0.397 


(0.038) 


0.506 (0.047) 




I 


0.760 


(0.024) 


0.678 


(0.030) 


0.683 (0.043) 


Sensitivity 


II 


0.639 


(0.024) 


0.407 


(0.025) 


0.581 (0.044) 




III 


0.667 


(0.024) 


0.556 


(0.027) 


0.647 (0.042) 




I 


0.980 


(0.004) 


0.993 


(0.002) 


0.997 (0.001) 


Specificity 


II 


0.979 


(0.004) 


0.997 


(0.001) 


0.996 (0.002) 




III 


0.978 


(0.004) 


0.964 


(0.006) 


0.976 (0.005) 




I 


0.047 


(0.003) 


0.035 


(0.003) 


0.014 (0.002) 


MR 


II 


0.063 


(0.004) 


0.056 


(0.002) 


0.019 (0.002) 




III 


0.059 


(0.004) 


0.073 


(0.005) 


0.035 (0.004) 




I 


0.069 


(0.004) 


0.069 


(0.004) 


0.069 (0.004) 


SMR 


II 


0.102 


(0.004) 


0.102 


(0.004) 


0.102 (0.004) 




III 


0.136 


(0.008) 


0.136 


(0.008) 


0.136 (0.008) 



HST model (the pairwise analysis). In addition, the first-order HST model 
increases the sensitivity by as much as 22% compared to the pairwise anal- 
ysis. Compared to the pairwise analysis, the hierarchical state space model 
helps to improve the specificity, reduce the false discovery rate, and reduce 
the misclassification rate. Using a first-order HMM structure in the state 
level in the first-order HST model helps to reduce the misclassification rate 
and improve the sensitivity. In summary, these results indicate that the first- 
order HST model outperforms the other two methods in the state level. In 
addition, we investigate the performance of our model in the mean level. 
Using the mean from the posterior distribution of the mean gene expres- 
sion level obtained from (2) in the first-order HST model as the summary, 
we find that posterior mean gene expression level reduces the variance and 
mean square error by 53% (from 0.019 to 0.009) compared to the sample 
mean gene expression level. 

4. Discussion. In this article we analyzed a short time course microar- 
ray experiment to investigate the differentiation of memory CDS T cells. To 
understand the differentiation of memory CDS T cells, we need to detect 
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Fig. 4. Summary of simulation results in the state level. 

TDE genes, identify the direction of successive changes over time, and char- 
acterize the magnitude of successive changes over time. Considering this, 
we develop a hierarchical state space model. To improve inference, a HMM 
structure is exploited to utilize the correlation of states across time and an 
empirical Bayes framework is utilized to borrow information from the large 
gene population. Results from the CDS T-cell data indicate a strong correla- 
tion over time in the state level and an empirical Bayes analysis helps us to 
capture this theme from the large gene population (see Table 2 for details). 
Results from a simulation study show that incorporating this correlation 
may help to increase the sensitivity of detecting TDE genes and to reduce 
the misclassification rate (see Table 4 for details). 

To analyze the CDS T-cell experiment, we developed a specific hierarchical 
state space model. The proposed model can also be easily applied to other 
short time course microarray experiments which share a similar structure 
to the CDS T-cell experiment (one biological condition and independent 
sampling). In addition, the specifications of the proposed model could be 
adjusted to suit different situations. For example, in the observation level, 
we could allow the sampling variance to vary over time instead of being 
time-independent. In the mean level, we could use a Gamma distribution, 
instead of a truncated Normal distribution, to model the successive changes 
over time. More details can be found in Wu (2007). 
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The proposed model can be extended in a straightforward manner to ex- 
periments with a longitudinal sampling design where subjects involved in 
experiments have repeated observations over time. For example, in a study 
of monkeys infected with simian immunodeficiency virus, the same group 
of monkeys were observed over time after infection [see Wu (2007) for more 
details]. The proposed hierarchical state space model could also be extended 
to time course microarray experiments with long time series. However, since 
the number of possible state series increases exponentially with the num- 
ber of the time points (3"^~^), inference using the EM algorithm would be 
computationally intensive. Updating the posterior distribution using Monte 
Carlo simulation with a sequential strategy [Liu and Chen (1998)] could be 
a promising approach for analyzing time course microarray data with long 
time series. 

Analysis of the CDS T-cell data using the proposed hierarchical state 
space model produces biologically meaningful results. With explicit optimal 
state series, genes can be easily grouped and studied, either marginally or 
jointly. Important temporal patterns involved in biological processes can be 
quickly identified and genes that have a pattern similar to known important 
biomarkers can be easily extracted for further investigation. These results 
may also be used for further high level analysis, such as network and path- 
way reconstruction. Analysis of the immune response suggests that multiple 
pathways are involved in the differentiation of memory CDS T cells, in- 
cluding cell adhesion/migration, signal transduction and effector response. 
Here we want to emphasize that this analysis is conducted at the genetic 
level. Providing rich insight into biological processes, these results also re- 
quire careful interpretation to answer questions of interest. For example, a 
high level of gene expression does not necessarily yield a high level of its 
functional protein. In immune response experiments, the gene expression 
levels for some genes related to antigen-killing functions such as GZMB and 
IFNG are higher in memory CDS T cells than in naive CDS T cells. How- 
ever, there is no corresponding high production of effector response-related 
proteins. This amazing strategy enables memory CDS T cells to have a 
quick recall response without releasing improper cytotoxic proteins harming 
healthy cells [Veiga-Fernandes et al. (2000)]. It is desirable to combine in- 
formation from other biological studies to obtain a complete understanding 
of the differentiation of memory CDS T cells in response to a viral infection. 



Under the gene- wise independence assumption, the complete log-likelihood 

is 



APPENDIX A: MODEL FITTING 
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G G 
= Y^\ogf{-^g\tig) + E E ^[^9 = v][log/(/xJs, = v) + logPr(v)]. 

9=1 9=1 ve5 

One may question this gene-wise independence assumption. However, con- 
sidering that parameter estimates in the proposed model are driven largely 
by the marginal distribution of the data, parameter estimates in a large- 
scale study like the microarray study ought to be reliable (though probably 
estimated less well than expected) in the presence of modest among-gene 
dependence. Treating (/X]^, . . . , ^q) and (si, . . . , sq) as missing data, we em- 
ploy the EM algorithm to find the MLE iteratively. Here we use the proposed 
parametric first-order HST model to illustrate the fitting procedure. Given 
thousands of genes in the experiments, can be reliably estimated using 
the usual unbiased estimator 

J G T nt 

where Xgt is the sample mean of the gene expression level for gene g at the 
tth time point. 

In the expectation step, we can estimate Pr(sg = v|xg) by 

I \ ^^(Sg = v)/(Xg|Sg = V) 

Pr(Sg=v|Xg) = — , V G 5. 

From Pr(sg|xg), we can estimate Pr(Sgt = i|xg) by X]vt=i Pi'Csg = v|xg) and 
Pr(sg(t-i) =i,Sgt= j\yig) by E^^_^=j^^^=j Pr(sg = v|xg) for z, j = +,-, = and 
t = 2, . . . ,T. In the maximization step, the initial state probabilities and the 
state transition matrices can be updated by 

_ 1 ^ ^ 

TTi = Pr(S't = i) = — E Pr(sgi = «|Xg), 

9=1 

= Pi'('S't-i = ^> •S't = j) = — E Pr(s9{t-i) = h Sgt = j|Xg) 

9=1 

and model parameters in the mean level can be updated by maximizing 

G 

n H P^{Sgt=i\^g)f{Xg{t~l),Xgt\Sgt = i). 

9=1 «=+-,= 

Iterate between the expectation step and the maximization step until con- 
vergence. 
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To focus on the successive changes over time, a simple center transforma- 
tion, Xgtk = Xgtk — Xgi, cdn used to simphfy the proposed model. After 
the center transformation, the proposed model can be rewritten as 

Xgtk = fj'gt + £gtk, 

flgi^ N{0,a^/ni), 
fj-gt = /ig(i-l) + ^gt, 

with other parts of the proposed hierarchical state space model unchanged. 
Here Xgtk represents the gene expression level after the center transforma- 
tion and figt represents the latent mean after the center transformation. 
Centering may induce some dependence as discussed in Dahl and Newton 
(2007). 



APPENDIX B: CONDITIONAL FDR 

Denoting by Ct the set of genes identified as TDE at each time point and 
denoting by C the combined set of genes identified as TDE, the expected 
FDR at each time point can be estimated by 



FBRt 



E^=iFr(.g.= "="|xg)/(g£a) 



the average of the conditional false discovery rate at time t for genes in Ct- 
Similarly, the overall expected FDR can be estimated by 

J2^=i PrCno change over time|xg)I((7 € C) 



FDR: 



E^=i/(5GC) 



the average of the conditional false discovery rate over all time points for 
genes in C [see Newton et al. (2004) for further elaboration on this point]. 
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